Skip to content

gh-142183: Change data stack to use a resizable array#148681

Open
dpdani wants to merge 5 commits intopython:mainfrom
dpdani:gh-142183-stack-resizable-array
Open

gh-142183: Change data stack to use a resizable array#148681
dpdani wants to merge 5 commits intopython:mainfrom
dpdani:gh-142183-stack-resizable-array

Conversation

@dpdani
Copy link
Copy Markdown
Contributor

@dpdani dpdani commented Apr 17, 2026

This PR changes the implementation of the Python stack to use a resizable array. This avoids the problem of calls that frequently cause the datastack_top (now called stack_top) pointer to switch between allocations.

After resizing, previous array allocations are not immediately freed because that would cause issues for various bits around the VM still pointing into them, and are instead freed along with the tstate.

During resizing, the previous contents of the stack are not copied into the new allocations, and instead the memory of the previous allocation is still used. Subsequently, popping and pushing frames, the new frames will always be residing on the new stack chunk allocation.

Overall it results in a ±1% performance change (within the noise range), but it avoids degenerate cases for any number of frames. I am also told it would allow further optimizations in the JIT.

@python-cla-bot
Copy link
Copy Markdown

python-cla-bot Bot commented Apr 17, 2026

All commit authors signed the Contributor License Agreement.

CLA signed

@pablogsal pablogsal self-assigned this Apr 17, 2026
@dpdani dpdani requested review from ambv and lysnikolaou as code owners April 20, 2026 12:33
@dpdani
Copy link
Copy Markdown
Contributor Author

dpdani commented Apr 27, 2026

@pablogsal can you take a look this week? 🙏

@pablogsal
Copy link
Copy Markdown
Member

@pablogsal can you take a look this week? 🙏

I can try but I have some other PRs first in my review queue :(

Copy link
Copy Markdown
Member

@pablogsal pablogsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I am worried about a couple of consequences that I think we should account for before continuing with this:

A minor concern is that this seems to change Python stack memory from “roughly current depth” usage to “high-water mark for the lifetime of the thread”. In the old chunked implementation, deep recursion allocated additional 16 KiB chunks and then released most of them while unwinding. In this version, resize_stack() keeps previous stack chunks linked from stack_chunk_list, and _PyThreadState_PopFrame() only moves stack_top; the chunks are not freed until the thread state is deleted.

This doesn't seem to be a lot so I am not too worried.

The worse concern I have I’m also concerned about profiler/debugger consequences. The old stack chunk layout allowed _remote_debugging/external unwinders to bulk-copy stack chunks cheaply. With this change, the active frame chain can span older chunks while only the newest chunk is copied in the new _remote_debugging path, so older frames fall back to individual remote reads. For a 1000-frame stack I measured:

  • old no-cache unwinding: 4 memory reads, ~1.2 KiB read
  • new no-cache unwinding: 966 memory reads, ~85.8 KiB read

Tachyon is probably mostly insulated because it uses frame caching, but first samples, cache-disabled paths, fallback paths, and external tools still care. Other profilers like austin currently hard-codes the old _PyStackChunk layout (previous, size, top, data), while this patch changes it to (size, previous, data), so those tools need explicit updates.

Given this and the potential gains I do not find the tradeoff very convincing....

@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented Apr 27, 2026

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Comment thread Modules/_remote_debugging/frames.c Outdated
chunk_addr = GET_MEMBER(uintptr_t, chunks[count].local_copy, offsetof(_PyStackChunk, previous));
count++;
// Process this chunk
if (process_single_stack_chunk(unwinder, chunk_addr, &chunks[count]) < 0) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnless I’m missing something, stopping after a single chunk here looks like a large perf regression for the profiler. The runtime still has a linked chunk chain via stack_chunk_list -> previous, but this now only copies the newest chunk. If the active frame chain spans older chunks, find_frame_in_chunks() misses those frames and we fall back to parse_frame_object(), which does one remote memory read per frame.

@pablogsal
Copy link
Copy Markdown
Member

pablogsal commented Apr 27, 2026

Unfortunately the more I think about it the less I like it: this model is harder to reason about than the previous linked-list model, and I think that matters because it creates a very easy footgun: the current chunk looks like the current stack backing store, but it is not.

Existing live frames may still be in older chunks, while newer frames are in the newest chunk at a matching logical offset. So any code that treats stack_chunk_list as “the stack” instead of “the head of a chain that may contain live frames” can silently become wrong. That already seems to happen here: copying only the head chunk misses older live frames and turns what used to be a cheap bulk-copy unwind into many per-frame remote reads.

The risk is not just external profilers. This makes future runtime/debugger code more fragile because pointer validity and frame ownership now depend on searching the whole chunk chain, not checking the current chunk.

@markshannon
Copy link
Copy Markdown
Member

@pablogsal
I don't understand why this PR would cause 1000 remote reads. A 1000 frames stack will only span a few chunks, since they grow exponentially. For very large stacks, there will be fewer chunks to copy, not more.

Existing live frames may still be in older chunks, while newer frames are in the newest chunk at a matching logical offset.

There seems to be a misunderstanding here. No two frames will have the same offset.

This makes future runtime/debugger code more fragile because pointer validity and frame ownership now depend on searching the whole chunk chain, not checking the current chunk.

The stack is a linked list of frames, not chunks. Some of which (generator and coroutine frames) aren't in chunks at all, so tools already need to handle pointers outside of the current chunk.

Overall, I don't see how this really changes anything for an out-of-process profiler: Copy all the chunks, then traverse the stack.

Also, note that the lower, unused part of the current chunk in a stack with multiple chunks will be untouched, so a profiler should be able to detect when it needs to cross to another chunk.

Other profilers like austin currently hard-codes the old _PyStackChunk layout (previous, size, top, data), while this patch changes it to (size, previous, data), so those tools need explicit updates.

That is trade off that tools and libraries make: either they use stable API/ABIs, or they probe CPython internals. If they do the latter, they will need updating every release.

@P403n1x87 would this be a problem for you?

@pablogsal
Copy link
Copy Markdown
Member

I don't understand why this PR would cause 1000 remote reads. A 1000 frames stack will only span a few chunks, since they grow exponentially. For very large stacks, there will be fewer chunks to copy, not more.

The issue I was pointing at is the current _remote_debugging implementation in this PR: copy_stack_chunks() now processes only the head chunk and sets out_chunks->count = 1. Once frame->previous points into an older chunk, find_frame_in_chunks() misses and we fall back to parse_frame_object(), which reads the frame remotely. That is where the ~966 remote reads for a 1000-frame stack came from.

The fix here is to either restore eager copying of the full stack_chunk_list -> previous chain using the new _PyStackChunk layout, or lazily copy previous chunks when the frame walk crosses out of the copied head chunk.

There seems to be a misunderstanding here. No two frames will have the same offset.

Also, you’re right that “same logical offset” was poor wording. I meant that the new chunk starts allocating at the old stack depth, leaving the lower part of the new chunk unused; not that two frames have the same offset.

@pablogsal
Copy link
Copy Markdown
Member

Here is a repro:

import os
import subprocess
import sys
import _remote_debugging

DEPTH = 1000

child_code = f"""
import os, sys, time
sys.setrecursionlimit({DEPTH + 1000})

def f(n):
    if n == 0:
        print("READY", os.getpid(), flush=True)
        time.sleep(10)
        return
    f(n - 1)

f({DEPTH})
"""

p = subprocess.Popen(
    [sys.executable, "-c", child_code],
    stdout=subprocess.PIPE,
    text=True,
)

line = p.stdout.readline().strip()
pid = int(line.split()[1])

try:
    unwinder = _remote_debugging.RemoteUnwinder(
        pid,
        all_threads=True,
        cache_frames=False,
        stats=True,
        debug=True,
    )
    trace = unwinder.get_stack_trace()
    frames = sum(len(t.frame_info) for interp in trace for t in interp.threads)

    print("frames:", frames)
    print("stats:", unwinder.get_stats())
finally:
    p.terminate()
    p.wait()

In main we can see:

❯ ./python repro.py
frames: 1002
stats: {'total_samples': 1, 'frame_cache_hits': 0, 'frame_cache_misses': 0, 'frame_cache_partial_hits': 0, 'frames_read_from_cache': 0, 'frames_read_from_memory': 0, 'memory_reads': 4, 'memory_bytes_read': 1200, 'code_object_cache_hits': 1000, 'code_object_cache_misses': 2, 'stale_cache_invalidations': 0, 'frame_cache_hit_rate': 0.0, 'code_object_cache_hit_rate': 99.8003992015968}

with this PR:

frames: 1002
stats: {'total_samples': 1, 'frame_cache_hits': 0, 'frame_cache_misses': 0, 'frame_cache_partial_hits': 0, 'frames_read_from_cache': 0, 'frames_read_from_memory': 0, 'memory_reads': 966, 'memory_bytes_read': 85768, 'code_object_cache_hits': 1000, 'code_object_cache_misses': 2, 'stale_cache_invalidations': 0, 'frame_cache_hit_rate': 0.0, 'code_object_cache_hit_rate': 99.8003992015968}

Notice memory_reads going from 4 to 966

@dpdani
Copy link
Copy Markdown
Contributor Author

dpdani commented Apr 28, 2026

The fix here is to either restore eager copying of the full stack_chunk_list -> previous chain using the new _PyStackChunk layout, or lazily copy previous chunks when the frame walk crosses out of the copied head chunk.

Yes, that's my bad. I made the smallest changes possible to the remote debugging module to get the PR working, but didn't inspect further improvements. Maybe it can be done in a follow-up PR by people more knowledgeable on the module? Or would you consider that a blocker?

@pablogsal
Copy link
Copy Markdown
Member

pablogsal commented Apr 28, 2026

Maybe it can be done in a follow-up PR by people more knowledgeable on the module? Or would you consider that a blocker?

This is a blocker. This PR adds a regression and that's not acceptable.

@pablogsal pablogsal force-pushed the gh-142183-stack-resizable-array branch from 8115b28 to 239e2eb Compare April 28, 2026 10:21
@pablogsal
Copy link
Copy Markdown
Member

@dpdani I pushed a fix for the concrete _remote_debugging regression: it now walks tstate->stack_chunk_list through previous and copies all stack chunks before traversing frames, instead of only copying the newest chunk. With the fix, the 1000-recursive-frame repro goes from roughly ~966 remote reads / ~85 KB read to 3 reads.

I also added a regression test to make sure deep stacks are resolved from copied chunks rather than falling back to parsing frames individually from remote memory.

@pablogsal
Copy link
Copy Markdown
Member

That said, I still think this solution is very confusing and too complex. It is harder to reason about where frames live, which chunks are relevant, and what invariants the implementation can rely on. I am worried this makes future changes easier to get subtly wrong.

@pablogsal pablogsal force-pushed the gh-142183-stack-resizable-array branch from 239e2eb to 99e9e44 Compare April 28, 2026 10:46
@pablogsal
Copy link
Copy Markdown
Member

pablogsal commented Apr 28, 2026

I investigated an alternative implementation in #149097: instead of replacing the stack with the resizable-array model, it keeps the current chunked- stack invariants and extends the existing one-chunk cache into a small bounded per-thread cache.

I think this is a better direction because it fixes the allocator-thrashing issue without changing where live frames can reside, without changing _PyStackChunk/PyThreadState layout, and without requiring _remote_debugging or external unwinders to learn a new stack model. Active frames remain only in the datastack_chunk -> previous chain; cached chunks are detached and inactive.

The cache is intentionally bounded: currently up to 8 * _PY_DATA_STACK_CHUNK_SIZE, so the memory cost is predictable and much smaller than retaining a high-water-mark stack for the lifetime of the thread.

I cehcked the original repro and it no longer shows per-branch mmap/munmap churn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants